Chapter 10 - New Developments: Topic Modeling with BERTopic!
Contents
Chapter 10 - New Developments: Topic Modeling with BERTopic!#
2022 July 30

What is BERTopic?#
As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”
Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”
For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.
If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.
This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)
In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.
Required installs:#
# Installs the base bertopic module:
!pip install bertopic
# If you want to use other transformers/language backends, it may require additional installs:
!pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'
# bertopic also comes with its own handy visualization suite:
!pip install bertopic[visualization]
Requirement already satisfied: bertopic in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (0.11.0)
Requirement already satisfied: tqdm>=4.41.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (4.64.0)
Requirement already satisfied: hdbscan>=0.8.28 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (0.8.28)
Requirement already satisfied: plotly>=4.7.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (5.10.0)
Requirement already satisfied: pyyaml<6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (5.4.1)
Requirement already satisfied: sentence-transformers>=0.4.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (2.2.2)
Requirement already satisfied: pandas>=1.1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.4.3)
Requirement already satisfied: umap-learn>=0.5.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (0.5.3)
Requirement already satisfied: numpy>=1.20.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.22.4)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.1.2)
Requirement already satisfied: scipy>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.9.0)
Requirement already satisfied: cython>=0.27 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (0.29.32)
Requirement already satisfied: joblib>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.1.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2022.1)
Requirement already satisfied: tenacity>=6.2.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from plotly>=4.7.0->bertopic) (8.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from scikit-learn>=0.22.2.post1->bertopic) (3.1.0)
Requirement already satisfied: torch>=1.6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (1.12.1)
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (4.21.1)
Requirement already satisfied: sentencepiece in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.1.97)
Requirement already satisfied: torchvision in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.13.1)
Requirement already satisfied: huggingface-hub>=0.4.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.8.1)
Requirement already satisfied: nltk in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (3.7)
Requirement already satisfied: pynndescent>=0.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from umap-learn>=0.5.0->bertopic) (0.5.7)
Requirement already satisfied: numba>=0.49 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from umap-learn>=0.5.0->bertopic) (0.56.0)
Requirement already satisfied: requests in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.28.1)
Requirement already satisfied: filelock in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.8.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (4.3.0)
Requirement already satisfied: packaging>=20.9 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (21.3)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (0.39.0)
Requirement already satisfied: setuptools in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (63.4.1)
Requirement already satisfied: six>=1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.1.5->bertopic) (1.16.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (0.12.1)
Requirement already satisfied: regex!=2019.12.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2022.7.25)
Requirement already satisfied: click in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from nltk->sentence-transformers>=0.4.1->bertopic) (8.1.3)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (9.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.0.9)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2022.6.15)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.3)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.1.0)
zsh:1: no matches found: bertopic[flair]
zsh:1: no matches found: bertopic[visualization]
Data sourcing#
For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:
import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
print(documents[0]) # Any ice hockey fans?
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
Creating a BERTopic model:#
Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:
language-> the language of your documentsmin_topic_size-> the minimum size of a topic; increasing this value will lead to a lower number of topicsembedding_model-> what model you want to use to conduct your word embeddings; many are supported!
For a full list of the parameters and their significance, please see https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py.
Of course, you can always use the default parameter values and instantiate your model as
model = BERTopic(). Once you’ve done so, you’re ready to fit your model to your documents!
Example instantiation:#
from sklearn.feature_extraction.text import CountVectorizer
# example parameter: a custom vectorizer model can be used to remove stopwords from the documents:
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
# instantiating the model:
model = BERTopic(vectorizer_model = stopwords_vectorizer)
Fitting the model:#
The first step of topic modeling is to fit the model to the documents:
topics, probs = model.fit_transform(documents)
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 topics, probs = model.fit_transform(documents)
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:301, in BERTopic.fit_transform(self, documents, embeddings, y)
298 if embeddings is None:
299 self.embedding_model = select_backend(self.embedding_model,
300 language=self.language)
--> 301 embeddings = self._extract_embeddings(documents.Document,
302 method="document",
303 verbose=self.verbose)
304 logger.info("Transformed documents to Embeddings")
305 else:
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:2035, in BERTopic._extract_embeddings(self, documents, method, verbose)
2033 embeddings = self.embedding_model.embed_words(documents, verbose)
2034 elif method == "document":
-> 2035 embeddings = self.embedding_model.embed_documents(documents, verbose)
2036 else:
2037 raise ValueError("Wrong method for extracting document/word embeddings. "
2038 "Either choose 'word' or 'document' as the method. ")
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
55 def embed_documents(self,
56 document: List[str],
57 verbose: bool = False) -> np.ndarray:
58 """ Embed a list of n words into an n-dimensional
59 matrix of embeddings
60
(...)
67 that each have an embeddings size of `m`
68 """
---> 69 return self.embed(document, verbose)
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/backend/_sentencetransformers.py:63, in SentenceTransformerBackend.embed(self, documents, verbose)
49 def embed(self,
50 documents: List[str],
51 verbose: bool = False) -> np.ndarray:
52 """ Embed a list of n documents/words into an n-dimensional
53 matrix of embeddings
54
(...)
61 that each have an embeddings size of `m`
62 """
---> 63 embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
64 return embeddings
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py:165, in SentenceTransformer.encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
162 features = batch_to_device(features, device)
164 with torch.no_grad():
--> 165 out_features = self.forward(features)
167 if output_value == 'token_embeddings':
168 embeddings = []
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/container.py:139, in Sequential.forward(self, input)
137 def forward(self, input):
138 for module in self:
--> 139 input = module(input)
140 return input
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py:66, in Transformer.forward(self, features)
63 if 'token_type_ids' in features:
64 trans_features['token_type_ids'] = features['token_type_ids']
---> 66 output_states = self.auto_model(**trans_features, return_dict=False)
67 output_tokens = output_states[0]
69 features.update({'token_embeddings': output_tokens, 'attention_mask': features['attention_mask']})
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1018, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
1009 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
1011 embedding_output = self.embeddings(
1012 input_ids=input_ids,
1013 position_ids=position_ids,
(...)
1016 past_key_values_length=past_key_values_length,
1017 )
-> 1018 encoder_outputs = self.encoder(
1019 embedding_output,
1020 attention_mask=extended_attention_mask,
1021 head_mask=head_mask,
1022 encoder_hidden_states=encoder_hidden_states,
1023 encoder_attention_mask=encoder_extended_attention_mask,
1024 past_key_values=past_key_values,
1025 use_cache=use_cache,
1026 output_attentions=output_attentions,
1027 output_hidden_states=output_hidden_states,
1028 return_dict=return_dict,
1029 )
1030 sequence_output = encoder_outputs[0]
1031 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
598 layer_outputs = torch.utils.checkpoint.checkpoint(
599 create_custom_forward(layer_module),
600 hidden_states,
(...)
604 encoder_attention_mask,
605 )
606 else:
--> 607 layer_outputs = layer_module(
608 hidden_states,
609 attention_mask,
610 layer_head_mask,
611 encoder_hidden_states,
612 encoder_attention_mask,
613 past_key_value,
614 output_attentions,
615 )
617 hidden_states = layer_outputs[0]
618 if use_cache:
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:493, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
481 def forward(
482 self,
483 hidden_states: torch.Tensor,
(...)
490 ) -> Tuple[torch.Tensor]:
491 # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
492 self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 493 self_attention_outputs = self.attention(
494 hidden_states,
495 attention_mask,
496 head_mask,
497 output_attentions=output_attentions,
498 past_key_value=self_attn_past_key_value,
499 )
500 attention_output = self_attention_outputs[0]
502 # if decoder, the last output is tuple of self-attn cache
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:423, in BertAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
413 def forward(
414 self,
415 hidden_states: torch.Tensor,
(...)
421 output_attentions: Optional[bool] = False,
422 ) -> Tuple[torch.Tensor]:
--> 423 self_outputs = self.self(
424 hidden_states,
425 attention_mask,
426 head_mask,
427 encoder_hidden_states,
428 encoder_attention_mask,
429 past_key_value,
430 output_attentions,
431 )
432 attention_output = self.output(self_outputs[0], hidden_states)
433 outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
KeyboardInterrupt:
.fit_transform()returns two outputs:topicscontains mappings of inputs (documents) to their modeled topic (alternatively, cluster)probscontains a list of probabilities that an input belongs to their assigned topic
Note:
fit_transform()can be substituted withfit().fit_transform()allows for the prediction of new documents but demands additional computing power/time.
Viewing topic modeling results:#
The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:
# view your topics:
topics_info = model.get_topic_info()
# get detailed information about the top five most common topics:
print(topics_info.head(5))
Topic Count Name
0 -1 6646 -1_file_use_need_using
1 0 1838 0_team_games_players_season
2 1 616 1_clipper_encryption_chip_nsa
3 2 527 2_cheek ken_ken huh_ignore art_huh ignore
4 3 452 3_israel_israeli_jews_palestinian
When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.
Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.
# access a single topic:
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
# get representative documents for a specific topic:
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics
["\ni have no idea, nor do i care. however, i'd like to point out that\nblomberg got the first plate appearance by a designated hitter, and\nthe first walk by a designated hitter. i am not sure, but i do not\nthink that he also got the first hit by a designated hitter.", ": >\n: >ATLANTIC DIVISION\n: >\t\n: >\tST JOHN'S MAPLE LEAFS VS MONCTON HAWKS\n: >\tMONCTON HAWKS\n: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,\n: >defensive, good goaltending. John Leblanc and Stu Barnes are the only\n: >noticable guns on the team. But the defense is top notch and \n: >Mike O'Neill is the most underrated goalie in the league.\n: >\n\n: Bri, as I have tried to tell you since 2 February, Michael O'Neill\n: might be the most underrated goalie in the AHL, but he ISN'T in the\n: AHL. He's on the Winnipeg Jets' injury list, as he has been since\n: his first NHL start against the Ottawa Senators. He's out until\n: next year after surgery to repair a shoulder separation.\n\n: Stu Barnes might be an AHL gun for the Hawks, but he's now the third\n: line center with the Jets, and has been since mid January or so.\n\nSorry, my memory is gone. I thought that O'Neill got sent back\ndown in February but I must have been given incorrect info. I guess\nthis says it all about Moncton because Barnes is still one of\ntheir top 3 or so scorers even though he's been out since January.", "\n\nI didn't see any smilies in this message so.......\n\n W T L PTs\n Team A 50 30 4 104\n Team B 52 32 0 104\n\n\nThere you go. Two teams that tie in points without identical records.\n\n"]
# find topics similar to a key term/phrase:
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics
# print the initial contents of the most similar topics
for topic_num in topics:
print('\nContents from topic number: '+ str(topic_num) + '\n')
print(model.get_topic(topic_num))
Most common topics:[0, 30, 6, 166, 4]
Contents from topic number: 0
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
Contents from topic number: 30
[('games', 0.03260548961663573), ('sega', 0.02366315012814771), ('arcade', 0.012166539858844822), ('snes', 0.010883627526511617), ('sega genesis', 0.01081910740506706), ('joysticks', 0.010294764495945618), ('games sale', 0.010085068481475858), ('sale', 0.00964091677280479), ('joystick', 0.009006639792149954), ('sega cd', 0.0074012373591723)]
Contents from topic number: 6
[('riding', 0.011792240692170709), ('ride', 0.011256591323418531), ('driving', 0.007418204752466058), ('road', 0.007362304673149508), ('traffic', 0.006971330162717447), ('roads', 0.005093305390738552), ('bikes', 0.0046328368271995445), ('bikers', 0.0041220512073587194), ('riders', 0.0037367046265679754), ('passengers', 0.0035386604055364823)]
Contents from topic number: 166
[('religion', 0.024810151190057972), ('war', 0.01958713595572545), ('wars', 0.0141305144151792), ('crusades', 0.012827683749926261), ('history', 0.01202363443416338), ('religious', 0.009458363539211138), ('unbelievers', 0.008338773663764506), ('yoked unbelievers', 0.007970064155940823), ('statement religion', 0.007495172035922859), ('gods', 0.0071255212864334274)]
Contents from topic number: 4
[('health', 0.0072259305085357), ('cancer', 0.005975505039095839), ('disease', 0.00513078203584376), ('tobacco', 0.005069613472607038), ('medical', 0.00492433353954727), ('hiv', 0.004709304265420622), ('malaria', 0.004112010029452724), ('smokeless tobacco', 0.004033769948845448), ('lyme', 0.003923377448522405), ('medical newsletter', 0.003903230753928965)]
Saving/loading models:#
One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!
# save your model:
# model.save("TAML_ex_model")
# load it later:
# loaded_model = BERTopic.load("TAML_ex_model")
Visualizing topics:#
Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.
Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!
Let’s see some examples!
# Create a 2D representation of your modeled topics & their pairwise distances:
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form!
model.visualize_barchart()
# Evaluate topic similarity through a heat map:
model.visualize_heatmap()
Conclusion#
Hopefully you’re convinced of how accessible but powerful a technique BERTopic topic modeling can be! There’s plenty more to learn about BERTopic than what we’ve covered here, but you should be ready to get started!
During your adventures, you may find the following resources useful:
Original BERTopic Github: https://github.com/MaartenGr/BERTopic
BERTopic visualization guide: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-terms
How to use BERT to make a custom topic model: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6
Recommended things to look into next include:
how to select the best embedding model for your BERTopic model;
controlling the number of topics your model generates; and
other visualizations and deciding which ones are best for what kinds of documents.
Questions? Please reach out! Anthony Weng, SSDS consultant, is happy to help (contact: ad2weng@stanford.edu)